15 research outputs found

    High-dimensional cluster analysis with the masked EM algorithm

    Get PDF
    This is an Open Access article published under a Creative Commons Attribution 3.0 Unported (CC BY 3.0) license https://creativecommons.org/licenses/by/3.0/Cluster analysis faces two problems in high dimensions: the "curse of dimensionality" that can lead to overfitting and poor generalization performance and the sheer time taken for conventional algorithms to process large amounts of high-dimensional data. We describe a solution to these problems, designed for the application of spike sorting for nextgeneration, high-channel-count neural probes. In this problem, only a small subset of features provides information about the cluster membership of any one data vector, but this informative feature subset is not the same for all data points, rendering classical feature selection ineffective.We introduce a "masked EM" algorithm that allows accurate and time-efficient clustering of up to millions of points in thousands of dimensions. We demonstrate its applicability to synthetic data and to real-world high-channel-count spike sorting data.Peer reviewe

    GP-Select: Accelerating EM using adaptive subspace preselection

    Get PDF
    We propose a nonparametric procedure to achieve fast inference in generative graphical models when the number of latent states is very large. The approach is based on iterative latent variable preselection, where we alternate between learning a selection function to reveal the relevant latent variables and using this to obtain a compact approximation of the posterior distribution for EM. This can make inference possible where the number of possible latent states is, for example, exponential in the number of latent variables, whereas an exact approach would be computationally infeasible.We learn the selection function entirely from the observed data and current expectation-maximization state via gaussian process regression. This is in contrast to earlier approaches, where selection functions were manually designed for each problem setting. We show that our approach performs as well as these bespoke selection functions on a wide variety of inference problems. In particular, for the challenging case of a hierarchical model for object localization with occlusion, we achieve results that match a customized state-of-the-art selection method at a far lower computational cost

    Improvements to the sequence memoizer

    No full text
    The sequence memoizer is a model for sequence data with state-of-the-art performance on language modeling and compression. We propose a number of improvements to the model and inference algorithm, including an enlarged range of hyperparameters, a memory-efficient representation, and inference algorithms operating on the new representation. Our derivations are based on precise definitions of the various processes that will also allow us to provide an elementary proof of the "mysterious" coagulation and fragmentation properties used in the original paper on the sequence memoizer by Wood et al. (2009). We present some experimental results supporting our improvements

    A Stochastic Memoizer for Sequence Data

    No full text
    We propose an unbounded-depth, hierarchical, Bayesian nonparametric model for discrete sequence data. This model can be estimated from a single training sequence, yet shares statistical strength between subsequent symbol predictive distributions in such a way that predictive performance generalizes well. The model builds on a specific parameterization of an unbounded-depth hierarchical Pitman-Yor process. We introduce analytic marginalization steps (using coagulation operators) to reduce this model to one that can be represented in time and space linear in the length of the training sequence. We show how to perform inference in such a model without truncation approximation and introduce fragmentation operators necessary to do predictive inference. We demonstrate the sequence memoizer by using it as a language model, achieving state-of-the-art results. 1

    The Sequence Memoizer

    No full text
    Probabilistic models of sequences play a central role in most machine translation, automated speech recognition, lossless compression, spell-checking, and gene identification applications to name but a few. Unfortunately, realworld sequence data often exhibit long range dependencies which can only be captured by computationally challenging, complex models. Sequence data arising from natural processes also often exhibits power-law properties, yet common sequence models do not capture such properties. The sequence memoizer is a new hierarchical Bayesian model for discrete sequence data that captures long range dependencies and power-law characteristics, while remaining computationally attractive. Its utility as a language model and general purpose lossless compressor is demonstrated. © 2011 ACM

    A Model-Based Spike Sorting Algorithm for Removing Correlation Artifacts in Multi-Neuron Recordings

    Get PDF
    Jonathan W. Pillow is with UT Austin, Jonathon Shlens is with the Salk Institute, E. J. Chichilnisky is with the Salk Institute, and Eero P. Simoncelli is with New York University.We examine the problem of estimating the spike trains of multiple neurons from voltage traces recorded on one or more extracellular electrodes. Traditional spike-sorting methods rely on thresholding or clustering of recorded signals to identify spikes. While these methods can detect a large fraction of the spikes from a recording, they generally fail to identify synchronous or near-synchronous spikes: cases in which multiple spikes overlap. Here we investigate the geometry of failures in traditional sorting algorithms, and document the prevalence of such errors in multi-electrode recordings from primate retina. We then develop a method for multi-neuron spike sorting using a model that explicitly accounts for the superposition of spike waveforms. We model the recorded voltage traces as a linear combination of spike waveforms plus a stochastic background component of correlated Gaussian noise. Combining this measurement model with a Bernoulli prior over binary spike trains yields a posterior distribution for spikes given the recorded data. We introduce a greedy algorithm to maximize this posterior that we call “binary pursuit”. The algorithm allows modest variability in spike waveforms and recovers spike times with higher precision than the voltage sampling rate. This method substantially corrects cross-correlation artifacts that arise with conventional methods, and substantially outperforms clustering methods on both real and simulated data. Finally, we develop diagnostic tools that can be used to assess errors in spike sorting in the absence of ground truth.This work was supported by: Royal Society Society USA/Canada Research Fellowship (JWP) (http://royalsociety.org/grants/); Center for Perceptual Systems, startup funding (JP) (http://www.utexas.edu/cola/centers/cps/); Sloan Research Fellowship (JWP) (http://www.sloan.org/); Miller Institute for Basic Research in Science (JS) (http://millerinstitute.berkeley.edu/); National Eye Institute (NEI) grant EY018003 (EJC, EPS); National Institutes of Health (NIH) Grant EY017736 (EJC); and Howard Hughes Medical Institute (EPS) (http://www.hhmi.org/). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.Psycholog
    corecore